DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce
نویسندگان
چکیده
Hierarchical clustering has been widely used in numerous applications due to its informative representation of clustering results. But its higher computation cost and inherent data dependency prohibits it from performing on large datasets efficiently. In this paper, we present a distributed singlelinkage hierarchical clustering algorithm (DiSC) based on MapReduce, one of the most popular programming models used for scalable data analysis. The main idea is to divide the original problem into a set of overlapped subproblems, solve each subproblem and then merge the sub-solutions into an overall solution. Further, our algorithm has sufficient flexibility to be used in practice since it runs in a fairly small number of MapReduce rounds through configurable parameters for data merge phase. In our experiments, we evaluate the DiSC algorithm using synthetic datasets with varied size and dimensionality, and find that DiSC provides a scalable speedup of up to 160 on 190 computer cores.
منابع مشابه
Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce
Single-linkage hierarchical clustering is one of the prominent and widely-used data mining techniques for its informative representation of clustering results. However, the parallelization of this algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. Moreover, in many modern applications, new data is continuously added into the already huge ...
متن کاملChoosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation
1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...
متن کاملHierarchical clustering of large text datasets using Locality-Sensitive Hashing
In this paper, we present a hierarchical clustering algorithm of the large text datasets using Locality-Sensitive Hashing (LSH). The main idea of the LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar are. The main drawback of the conventional hierarchical algorithms is a large time complexity (e.g. Single Linka...
متن کاملPARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework
Large datasets, of the order of petaand terabytes, are becoming prevalent in many scientific domains including astronomy, physical sciences, bioinformatics and medicine. To effectively store, query and analyze these gigantic repositories, parallel and distributed architectures have become popular. Apache Hadoop is a distributed file system that provides support for dataintensive applications. I...
متن کاملFinding Connected Components on Map-reduce in Logarithmic Rounds
Given a large graph G = (V,E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of mapreduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013